Applying the Waek Learning Framework to Understand and Improve C4.5
نویسندگان
چکیده
There has long been a chasm between theoretical models of machine learning and practical machine learning algorithms. For instance, empirically successful algorithms such as C4:5 and backpropagation have not met the criteria of the PAC model and its variants. Conversely, the algorithms suggested by computational learning theory are usually too limited in various ways to nd wide application. The theoretical status of decision tree learning algorithms is a case in point: while it has been proven that C4:5 (and all reasonable variants of it) fails to meet the PAC model criteria [2], other recently proposed decision tree algorithms that do have non-trivial performance guarantees unfortunately require membership queries [6, 13]. Two recent developments have narrowed this gap between theory and practice|not for the PAC model, but for the related model known as weak learning or boosting . First, an algorithm called Adaboost was proposed that meets the formal criteria of the boosting model and is also competitive in practice [10]. Second, the basic algorithms underlying the popular C4:5 and CART programs have also very recently been shown to meet the formal criteria of the boosting model [12]. Thus, it seems plausible that the weak learning framework may provide a setting for interaction between formal analysis and machine learning practice that is lacking in other theoretical models. Our aim in this paper is to push this interaction further in light of these recent developments. In particular, we perform experiments suggested by the formal results for Adaboost and C4:5 within the weak learning framework. We concentrate on two particularly intriguing issues. First, the theoretical boosting results for top-down decision tree algorithms such as C4:5 [12] suggest that a new splitting criterion may result in trees that are smaller and more accurate than those obtained using the usual information gain. We con rm this suggestion experimentally. Second, a super cial interpretation of the theoretical results suggests that Adaboost should vastly outperform C4:5. This is not the case in practice, and we argue through experimental results that the theory must be understood in terms of a measure of a boosting algorithm's behavior called its advantage sequence. We compare the advantage sequences for C4:5 and Adaboost in a number of experiments. We nd that these sequences have qualitatively different behavior that explains in large part the discrepancies between empirical performance and the theoretical results. Brie y, we nd that although C4:5 and Adaboost are both boosting algorithms, Adaboost creates successively \harder" ltered distributions, while C4:5 creates successively \easier" ones, in a sense that will be made precise.
منابع مشابه
Enhancement Of A Chinese Discourse Marker Tagger With C4.5
Discourse markers are complex discontinuous linguistic expressions which are used to explicitly signal the discourse structure of a text. This paper describes efforts to improve an automatic tagging system which identifies and classifies discourse markers in Chinese texts by applying machine learning (ML) to the disambiguation of discourse markers, as an integral part of automatic text summariz...
متن کاملClassification of SchoolNet Data
SchoolNet Data, mainly educational material, was authored by SchoolNet to make it easy for teachers and learners to find educational resources in various subjects. The task of automatically assigning subject categories to learning materials has become one of the key steps for organizing online information. Since hand-coding classification rules is costly or even impractical, most modern approac...
متن کاملEnhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...
متن کاملMMDT: Multi-Objective Memetic Rule Learning from Decision Tree
In this article, a Multi-Objective Memetic Algorithm (MA) for rule learning is proposed. Prediction accuracy and interpretation are two measures that conflict with each other. In this approach, we consider accuracy and interpretation of rules sets. Additionally, individual classifiers face other problems such as huge sizes, high dimensionality and imbalance classes’ distribution data sets. This...
متن کاملNatural Induction and Conceptual Clustering: A Review of Applications
Natural induction and conceptual clustering are two methodologies pioneered by the GMU Machine Learning and Inference Laboratory for discovering conceptual relationships in data, and presenting them in the forms easy for people to interpret and understand. The first methodology is for supervised learning (learning from examples) and the second for unsupervised learning (clustering). Examples of...
متن کامل